Subset Selection for Gaussian Markov Random Fields 



Satyaki Mahalanabis*and Daniel Stefankovic 

{ smahalan , stef anko} @cs . rochester.edu 
University of Rochester 



September 27, 2012 



Abstract 

Given a Gaussian Markov random field, we consider the problem of selecting a subset 
of variables to observe which minimizes the total expected squared prediction error of the 
unobserved variables. We first show that finding an exact solution is NP-hard even for a 
restricted class of Gaussian Markov random fields, called Gaussian free fields, which arise in 
semi-supervised learning and computer vision. We then give a simple greedy approximation 
algorithm for Gaussian free fields on arbitrary graphs. Finally, we give a message passing 
algorithm for general Gaussian Markov random fields on bounded tree-width graphs. 



1 Introduction 



Given the joint distribution of a set of random variables (in the form of a Markov random 
field), we consider the problem of selecting a small subset of these variables to observe so as 
to acc urately predict the remaining unobserved variables. We focus here on Gaussian pro- 



cesses (jRasmussen and Williams! . 120061 ) on graphs, i.e., Gaussian Markov random fields (Gaussian 
MRFs). Our aim in this paper is to give a subset selection algorithm which, given a budget for 
the number of variables that can be observed, minimizes the expected squared prediction error 
averaged over all the variables. We are particularly interested in algorithms with provable guar- 
antees on the prediction error. Our main focus is on Gaussian MRFs on trees and other tree-like 
graphs, or to be precise, bounde d tree-width grap hs — such graphs have been widely studied in 



the context of inference, see, e.g., Sudderthl ( 20021 ). Wc also consider a special class of Gaussian 



MR Fs, called Gaus sian free fields (or GFFs), which arise, among others, in computer vision, see, 
e.g., Szeliski f|l99Clh . We first explain the notation we use and formally state our problem before 
describing how our work relates to previous research. 



1.1 Notations 

We will use boldface and lowercase to denote a vector, e.g., z, and use Zi to denote its i th 
component. We will use uppercase for matrices and random variables (including vectors of 

*This research was done while the author was at the University of Rochester. 
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random variables). For any nx n matrix M and subsets V,V of {1,2 ... ,n}, we will use 
7\/[V, V] to be the submatrix indexed by rows in V and columns in V. Further, M[i, :] and 
M[:, i] will denote respectively the i th row and the i th column of M. We will say M has support 
V x V' if all non-zero entries in M occur in M[V, V'}. 

For any k, Ik will denote the k x k identity matri x. We will denote the spa ce of n x n sym- 
metric positive semidefmite matrices (see Chapter 7 of Horn and Johnso"nl ( 19851) for a definition) 
by X™ xn , and the usual or dering on positive semidefinite matrices by ^ (see Definition 7.7.1 
of IHorn and Johnson The space of all matrices in X" xn with support V x V will be 

denoted XY xV . We will use Q nxn to d enote the class of al l n x n symmetric diagonally dominant 
matrices (see, e.g., Definition 6.1.9 of IHorn and Johnson! (|1985l) ) with non-positive off-diagonal 
entries. Such matrices include, e.g., graph Laplacians. 

Given M E X? Xn , we will use X m i n (M) and X max (M) to denote respectively the smallest 
and the largest non-zero eigenvalues of M, if they e xist. Further, if M i s of f ull rank then the 
condition number of M will mean (see Chapter 5.8 orn and Johnson! (|1985l) ) the ratio of the 
largest to the smallest eigenvalue of M. 

For random variables X = (Xi,X2, ■ . ■ , X n ) and for any set V C {1, 2, . . . , n}, we will let Xy 
be the coordinates of X indexed by V, i.e., a vector (of size |V|) of the random variables indexed 
by V. If p is the joint density function of variables X, we will use E p [- | Xg] and V p [- | Xg] 
to denote respectively the conditional expectation and the conditional variance given X5 (i.e., 
given variables in S are observed). We will drop the subscript p when the density is clear from 
the context. 

Some of our al gorithms will be fully polynomial time approximation schemes (FPTAS) — - 



sec 



Vaziranil (l200l[ ) for a definition. 



1.2 Definitions and Problem Statement 

A Gaussian MRF on a graph G = ({1,2,..., n}, E) is a Gaussian process X — (Ai, A2, . . . , X n ] 

with covariance matrix S (of full rank) such that E = {{i,j} | Ay 7^ 0}, where A = is the 
inverse covariance matrix, also called the precision matrix. This means that the joint density 
of X, jo(X), is Markov wi th respect to G, i.e., that p(X) factorizes over the cliques of G (see, 
e.g., Sudderth et al.l (|2004h ). The problem we study assumes the parameters (the means and the 
covariance matrix) of the Gaussian MRF are known, and hence we assume, w.l.o.g., that X is 
origin-centered. 



Th e following are som e wel l-known facts a bout Gaussian processes (see, e.g.. lRasmussen and Williams 
( 2006 ); Sudderth ( 20021) . also Krause et al. ( 2008 )). Given variables X5 to observe, for each X t 
the linear predictor w(S, i) Xs which minimizes the expected squared error, i.e., 



def 



arg mm 

wGRl s 



E[(A 4 -w*X s ) 2 ]. 



(1) 



is given by w(S,i) = S'] _1 I][S', i], and is an unbiased estimator of Xi, i.e., w(S, i)*Xs = 
E \Xi I X5I . The minimum expected squared error for linear prediction for a Gaussian process 
equals the conditional variance of Xi given X5, i.e., 



E[(A, — w(S, i)*Xs) 2 ] = E[i,i] - E[i,5]E[S',S']- 1 E[5,i] 



E 



[Xi-E[Xi|X s ]) 2 I Xc 



V[A, I X S ], 



(2) 
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and is independent of the actual observed values of X5. In light of @, the average expected 
squared error for predicting all the variables, 



i 

turns out to be 



(3) 



n 

i4S 



crr ( 5 ) = It. v t^ 1 m = ^E v t^ 1 x *l w 

and can also be expressed as 

err(S) = -Tr ( ~S\ - E[S, S] (ELS, S]) _:L £[S, 5] 

W " V / (5) 

Our error function err(S'), and hence V[JQ | X5], is clearly monotone decreasing in S*. In 
fact we can show that the expected conditional variance E [V \Xi | Xs] ] , for any set of random 
variables X\ , X2 , • ■ • , X n (not necessarily Gaussian) is monotone decreasing. 

Lemma 1. For any subsets S CT and for any X it E[V[X, | X s ]] > E[V[X, | X t ]]. 

Proof of Lemma [1] 

We will use the identity that for any sets S and T, 

E[V[Xi I X SUT ] I X s ] = V[Xi I X s ] - V[E[Xi \ X Su t] \ X s ] . (6) 

Now the lemma follows from the fact that S C T and taking expectation of both sides of (|6]) 
w.r.t. Xs- M 

We formally state the problem of finding a variable selection strategy below. 

Question 2. 7s there an algorithm which given a budget b finds a set S* of size \S*\ < b such 
that 

ci~r(S*) = min crrfS"). (7) 

v ' \S\<b v ' 

Question 3. The cover version of Question\^is whether an algorithm exists which, given a > 0, 
finds the smallest S* such that 

\S*\ =min {\S\ | err (5) < a}. (8) 

A Gaussian free field or GFF (see, e.g., Chapter 2.7 of iLvons and Peres! (|201lh ) is a special 
case of a Gaussian MRF on a connected graph G, where each edge {i, j} € E is associated with 
a finite weight r,j = rji > 0. We assume that for each {i,j} ^ E, rij = rji = +00. Fix a node, 
say 1, and assume that X\ = 0. Then the density of a GFF is given by 

p(X u X 2> ...,X n ) oc expf - {Xi 2r Xj)2 )- (9) 
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We set one variable, namely X\, to so that the density p in ([§]) is well defined. This 
corresponds to always selecting the variable Xi for observation. Consider the Laplacian A 

A[t,j] * ( £*i( 1 / r «) l i = j (10) 
[ —1/nj otherwise, 

Although a GFF really defines a distribution on X{2,. «} with precision A[{2, . . . ,n}, {2, . . . , n}], 
we will work with the matrix A so that we can treat X\ symmetrically with the other variables. 
Note that any Gaussian distribution whose precision matrix is strictly diagonally dominant with 
non-positive off-diagonal entries can be thought of as the marginal distribution (over n variables) 
in a GFF with n + 1 variables. 

1.3 Summary of Contributions 

We list the main results in this paper in the order they are presented: 

• Finding an exact solution to Question[2]is NP-hard even for GFFs (Section[2J Thcorcm[5|). 

• The average expected squared error function err is supermodular for GFFs, thereby giving 
greedy approximation algorithms for Questions [2] and [3] (Section [2j Theorems [9] and ITT]) . 

• There is a FPTAS for GFFs on bounded tree- width graphs based on message passing (i.e., 
dynamic programming). While it is not difficult to formulate a dynamic programming 
algorithm for Gaussian MRFs on trees (which are equivalent to a GFFs on trees), extending 
it to bounded tree-width graphs is non-trivial and requires a more intricate analysis of the 
error (Section 13.1.11 Theorem [29]) . 

• There is a similar FPTAS based on dynamic programming for general Gaussian MRFs 
on bounded tree-width graphs, whose running time however scales as a polynomial in the 
condition number of the input precision matrix (Section 13.21 Theorem I43|) . 



1.4 Related Work 



We first compare our error function (i.e., the average expected squared error) with those used 
earlier in the context of subset selection for Gaussian Proce sses. Prediction for Gaussian pro cesses 
is popularly known as kriging in spatial statistics (see, e.g., Rasmussen and Williamsl ( 20061 )). and 
the squared prediction error for an unobserved variable is referred to as its kriging variance. The 
problem we try to solve can be thought of as minimizing the average kriging varianc e (i.e., average 



expec ted squared error) over all the unobserved variables. A closely related work is lKrause et al 



(|2007j ). who consider the problem of minimizing the maximum kriging variance of an unobserved 
variable rather than the average. Also, there is extensive literature exploring subset selection for 
Gaussian processes using different c riteria like entropy a n d mutual information betwe en observed 
and unobserved variables, see, e.g.. iKrause et al.1 (|2008f ): iKrause and Guestrinl (|201lh . 

Subset selection problems similar to ours arise in a number of other areas. For instance, 
our problem is qui te similar to the widely studied problem of subset selection for regression in 
statistics, see, e.g.. iMiilerl (J2002) for an overview. Our objective differs from subset selection in 



regression in that we aim to minimize the tot al prediction erro r of all the u nobserved variables 
rather than that of a single variable. Recently Das and Kempei ( 2008 . 2011 ) have analyzed and 
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provided provable guarantees for several well-known greedy algorithms for subset selection, such 
as forward selection. 

Our goal of minimizing the average expected squared error of unobserved variables is also 
equivalent to minimizing the trace of the inverse of a principal submatrix of the inverse co- 
variance matrix. A similar tr ace minimization problem aris es in Baye si an an d transductivc 
experimental design, see, e.g., Chaloner and Verdinelli ( 1995 ); lYu et al.l ( 2006 ). For a linear 
mod el with a Gaussian prior over the unknown parameters, the Bayesian A-optimality crite- 
rion (|Chaloner and Verdinellil . 119951 ) reduces to minimizing the trace of the conditional covariance 
matrix of the parameters given the selected experiments (observations). However, our objective 
differs from A-optimality in that we want to minimize the prediction error of all the unobserved 
variables given the observed ones. In contrast, A-optimality requires minimizing the error of only 
a given subset of unobserved variables (i.e., the parameters being estimated), and moreover none 
of the parameter values can be observed. 

The budget version of our problem, Question^ can alternatively be formulated (using expres- 
sion (|S|) for the error) as finding a low rank approximation of the positive s emidefinite covariance 
matr ix S, or to be precise, a rank b Nystrom approximation (see, e.g., (jWilliams and Seeger . 
200lh ) of S which minimizes the trace norm error. Nystrom approximation has applications 



in Gaussian process regression, kernel machines and dimen sion reduction among others (see, 
e.g.. lSmola and Scholkopf <l2000h : IWilliams and Seegerl faOOlb ). However, to the best our knowl- 
edge, we are the first to focus on subset selection for bounded tree-width Gaussian MRFs (in- 
cluding GFFs) — we exploit the sparse structure of the precision matrices for such MRFs to give a 
dynamic programming based algorit hm. Subset selection strate gies for Nystrom approximation 
base d on greedy heuristics (see, e. g., Smola and Scholkopfl (|2000h ) and on random sampling (see, 



Drineas and Mahonev ( 2005 ): Kumar et al. ( 20091 )) have been studied before. In contrast 



e.g., 

the subset selection strategies we give in this paper are deterministic, have provable error bounds, 
and moreover have multiplicative rather than additive approximation guarantees for the error. 

We next discuss our motivation for considering bounded tree-width Gaussian MRFs and 
Gaussian free fields. Our motivation for studying bounded tree-width graphs comes from the 
fact that the exact solution of many pr oblems (such as fin d ing a maximum independent set and 
inference in graphical models, see, e.g., Bodlaender ( 1997 ); Roller and Friedman ( 2009( )). which 
are infeasible in general, become tractable for bounded tree- width graphs. There is extensive 
literature on inference algorith ms for graphical models, especially Gaussian MRFs — we refer to 
Chapter 2 of ISudderthl (|2002l ) for a survey. We point put that though inference for Gaussian 
MRFs can be performed in polynomial time (since it involves mainly a matrix inversion) for any 
graph, our problem of subset selection is much harder. In fact, as we show later, our problem 
is NP-hard even for the restricted case of Gaussian free fields. Also note that our problem is 
harder than, e.g., maximum independent set, in the sense that our problem does not fit into the 
framework of monadic second order logic on graphs (ICourcelld . Il990i: lBodlaendeiill997l) . 

A Gaussian free field (or GFF) is a special case of a Gaussian MRF, and can be thought 
of as "continuous analog" of a ferromagnetic Ising Model. A GFF corresponds to the inverse 
cova riance mat r ix bei n g a graph Laplacian, and is widely used in semi-supervised learni ng, (see , 
e.g., Zhu et al.l (2003 ); Belkin and Nivogi ( 2004 )) as well as in computer vision, e.g., ( Szeliski . 
19901) . In lZhu et al.l (I2003I a GFF is used to model the distribution of a discrete binary MRF, i.e., 
an Ising Model, given a set of observed labels. Specifically, the expected value of an unobserved 
node in the GFF is used as an approximation for the node's expected (binary) label value. A 
simple adaptive selection strategy is then given which queries the node with the most uncertain 
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label. Our aim here, unlike that of IZhu et al.l (|2003l ). is to (non-adaptively) select subsets for 
prediction in a GFF rather than use a GFF for adaptive selection in Ising Models. 



2 Gaussian Free Field 

We prove that finding an exact solution to Question [2] is not feasi ble even for GFFs . Note that 
the related problem of subset selection for regression is NP-hard (jNataraian . Il995l) and is also 
hard to approximate within a constant factor when the subset size is 0(logn) ( Das and Kempel . 



20081 ). However our infeasibility proof is for the special case of GFF, and hence does not follow 



from the hardness proofs for subset selection. We need the following characterization of our 
problem for regular graphs. 

Lemma 4. Let graph G be d-regular and let each edge {i,j} have weight rtj = 1. Then for any 
set S of nodes, err(S') > (l — \S\/n)/d where the equality holds iff S is an independent set in G. 

Proof of Lemma [4j 

Since G is d-regular and each edge has weight 1, each diagonal entry of A is d. Hence for any set 
S, 

n-\S\ 

Tr(A[S,S]) = (n-\S\)d = £ A 4 (A[S, S}), (11) 

i=i 

where A^A^, S}) are the eigenvalues of A[S, S] in any order. Using expression ([5]) for the error, 
we have by (fTTj) that 



crr(S) = — Tr (A[S, = - £ 



« V ' n ^ Xi (A[S, S] 



> I <g I^D 2 = (l-\ S \/n)/d, 



(12) 



where in the third step we used the fact that arithmetic mean is greater than harmonic mean, 
which are equal only if each of the elements (i.e., each eigenvalue of A[S, S]) arc equal, i.e., where 
equality holds iff 

Ax(A[S,S]) = A 2 (A[5,5]) = ••• = A„ HS |(A[S,S]) = d. (13) 

However (|13|) holds iff A[S, S] is a diagonal matrix, i.e., S is an independent set. The Lemma 
now follows. ■ 

Theorem 5. The budget version ([7]) of Question&is NP-hard. 

Proof of Theorem [5} 

The problem of finding, for any n, k, d and any d-rcgular graph on n nodes, wh ether an indepen- 
dent s et of size k exists, is known to be NP-complete, see, e.g., problem [GT20] in lGarev and Johnson 
In fact, finding independent sets is NP-complete even for boundcd-dcgrcc planar graphs. 



(> 



It follows from Lemma 2] that an independent set of size n — b exists iff for S* as defined in ([7]) 
and for budget 15*1 < b, we have err(S*) = (1 - b/n)/d. ■ 

We now give an approximation algorithm for Qucstion[5]usmg supcrmodularity of the average 
expected squared error. Si nce we directly consider the variance instead of variance reduction (i.e, 
the i? 2 -statistic, see, e.g., Das and Kenrpd (|2008h ^l. it is more convenient to use the notion of 
supcrmodularity rather than submodularity. 



Definition 6. A function / : 2* 1 ' 2 "-'™} 
returns" condition holds: 



is supermodular if the following "diminishing 



(yA)(Vx,y) f(A)-f(Au{x})>f(Au{y})-f(Au{x,y}). (14) 

Given oracle access to a non-negative supermodular function / : 2^ 1,2, "'' n ^ — ¥ R + such that 
/({l, 2, . . . , n}) = and (VA, B) A C B =S> /(A) > f(B) (i.e., / is monotone non-increasing), the 

def 

supermodular minimization problem is to compute = arg min f(S), while the supermod- 
ular cover problem is to compute S 1 * — arg min \S\. We refer to iNemhauser et all (jl978h : 



f(S)<a 



Wolsevl (|1982I ) for the well-known greedy algorithm for both constrained minimization and cover 



problems for supermodular functions. 

Our main result in this section is a proof that the error function err is supermodular, for 
which we need the following well-known connection between electrical networks and GFFs. 

Lemma 7. (see, e.g., Lemma 2.15 of Ding et al. (201 ft )) Consider the GFF given by any 
non-empty set S and any i (£ S. Then 



V[X> | X s ] =R^(i,Sf), 

where R e g(i,S) is the effective resistance between i and S 
interpreted as the resistance of edge between i,j. 



G with each rij (= Tji) being 



Lemma 8. For a GFF the error function err(S*), or equivalently the function Tr (A[S', S] x ) 
(using is non-increasing and supermodular in S. 

We defer the proof to the end of this section. The following example illustrates that Lemma[5] 
is not true for Gaussian processes in general. 

Example 1. For the multivariate Gaussian with covariance matrix 

0.4435 0.1092 -0.0905 -0.0527 

0.1092 0.3041 0.0256 0.0227 

-0.0905 0.0256 0.1273 -0.1444 

-0.0527 0.0227 -0.1444 0.3752 

crr({l}) = 0.1887, crr({F 2}) = 0.1162, crr({l, 3}) = 0.1009 and crr({l,2,3}) = 0.0263 which 
violates supcrmodularitjo. 



x all figures precise up to ±10 
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Submodularity has been used previously (see, e.g., ( Krause et alj , l2008h ) for a similar problem, 
namely sensor placement, where the objective (unlike ours) is usually to maxi mize the mutual 
i nform ation between observed and unobserved nodes. We also point out that in lDas and Kenxpe 
( 20081 ). it is shown that the absence of suppressor varia bles is a necessary and s ufficient condition 
for the error to be supermodular (see Theorem 8.1 of iDas and Kempd (|2008l) for details). Our 
proof of supermodularity docs not use their result — instead we use the analogy with electrical 
network s. Note that Lemma El implies that a GFF does not have any suppressor variables in the 
sense of Das and Kempd ( 2008 ). 

In IDas and Kempe "(120111) . a sufficient condition for supermodularity of V[Xj | Xg] for a 
Gaussian process (see Definition 2.3 and Lemma 2.4) is given in terms of eigenvalues of covariance 
matrix. However as Example [5] shows, their result does not apply to Lemma E] 

Example 2. Consider a GFF on the complete graph on 5 nodes with weights (Vz ^ j) =5/2 
and with X\ = 0. Then (Vi > 1) V[-X"j] = 1, but th e covariance mat r ix bet ween X2, X3, X4 
has a minimum eigenvalue 0.5. Hence Lemma 2.4 of IDas and Kempe does not imply 

supermodularity of V[X; | Xs] for this GFF. 

Give n Lemma El an an s wer t o the budget version follows immediately from the greedy algo- 
rithm of Nemhauser et "all (1978). 

Theorem 9. For the budget version (JT]) of Question^ there is a greedy algorithm which for any 
budget b, outputs a set Sb such that err(Sb) < rpjj-^ mm |s|<& err(S'). 

The proof for cover version (Question [3]), Theorem II 1[ requires Lemma [TU1 

Lemma 10. (see Theorem 1 of[ Wolseil UM)) There exists an algorithm which for any non- 
negative supermodular non- increasing function f and any a > computes set S a such that 
f(S a ) < a and \S a \ < (1 + k)\S* \, where 

/(0)-/(W) 



in 



max 

x,S 



f(S)> f(SU{x}) 



(15) 



J(S)-f(Su{x}) 

Theorem 11. There exists an algorithm which given any GFF and any a > 0, outputs a set S, 
such that err(5 a ) < a and 



where r = 



\S a \ < [l + lnUn-l 
id R = maxj !i3 ) ££ r^ . 



,R 



mm 

err(S)<a 



\s\, 



(16) 



Proof of Theorem lilt 

Since by LcmmaElthe function err is non-increasing and supermodular, wc can apply Lemma [TUl 
It only remains to be prove an appropriate upper bound on the r.h.s. of (|15|) . Wc will show that 
for any x, S s.t. x ^ S (and keeping in mind that we assume 1 € S), 



err({l})-err({l,x}) < (n - l) 2 R 
err(S) - err(S* U {x}) ~ r 

We will use the fact that by Lemma [7] and (j4|) , 



(17) 
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We can bound 

err(S) - err(S' U {x}) > err({l, 2, . . . , n} \ {x}) - err({l, 2, . . . ,n}) 

(since err is supermodular) 
= crr({l,2,...,n}\{x}) 
= R eS (x,{l,2,...,n}\{x}) > r/(n-l), 

since there can be, in the worst case, n — 1 edges of resistance r between x and the rest of the 
nodes. Also we have 

err({l})-err({l,a;}) < crr({l}) < max R cS (y, {1}) < (n-l)R. (19) 
From (|18[) and (fT9|) we get (fl7)) , which gives the desired upper bound on (fT5j) . ■ 



Hence the subset chosen by our algorithm is only 0(ln n) times the optimal. For proving 
Lemma [8] we need Lemmas [T2l and [T3l 

Lemma 12. (Thomson's Principle, see, e.g., Chapter 2.4 of lLvons and Peret 1 201 A )) For any 
unit flow f in G from a set of nodes S to any node t (ft S , define 



2 r 



The 



V[X t | X S ] =min £{f). 



Proof of Lemma |T2} 

From Thomson's Principle we have that R e s(t , S) = min/ £ (/) (see, e.g., Chapter 2.4 of lLvons and Peres 
(|201lh . also Lemma 2.11 of iDing et al] (|201l[ )). Our claim now follows from Lemma [7] ■ 

The next lemma states that given 2 unit flows in a network from 2 sources S and T to a 
common sink, one can always construct 2 other unit flows from sets S U T and 5 fl T to the sink. 

Lemma 13. Consider any set S and any distinct x,y,t S. Then for any 2 unit flows f\,fi 
in G respectively from S, S U {x, y} to t, there exist 2 corresponding unit flows g\, gi respectively 
from S U {x}, S U {y} to t such that for each i =/= j , 

9i(i,j) +92{i,j) = fi[i,j) + h{i,j) and, (20) 

mm{f 1 (i,j),f 2 (i,j)} < gi(i,j) < max{f 1 (i,j),f 2 (i,j)}, ^ 
min{h(hj)J2(i,j)} < 92^, j) < max{/i(i, j), f 2 (i, j)}- 

We defer the proof of Lemma [T3] to the end of this section. We now have all the ingredients 
for proving our main result. 
Proof of Lemma [8j 

Lemma [T] implies that crr(S') is non-increasing. 

We will prove that for any t, V[X t | X5] is supermodular, which by (|14[) . requires us to show 
that for any set S and any 2 nodes x, y (fc S, 

V[X t I X S ] +V[X t I Xsu {x ,y}] > V[X t I X SuM ] +V[X t \ X Su{w} ]. (22) 
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The supermodularity of err(S) follows from fl52J} since err(S) = \ £V I x s] -Hie SU{:c, y} 

then (|22)) follows easily from Lemma [TJ 

Now assume t ^ S U {x, y}. By Lemma [T51 for any 2 unit flows /i, /2 in G from S, S U {x, y} 
respectively to i, there exist unit flows gi, g 2 from SU{x}, SU{y} respectively to t satisfying (f2TJ)) 
and (J2I1)- It follows from (gHJ), (EH that for each i ^ j 



which implies 



£(fi) + £(/ 2 ) = \Y. (A 2 (M) + fi(ij))rij 

>ji;W'j) + ^-?)) r « (23) 

= f(. 91 ) + f(. 92 ). 

We can now combine Lemma [T^] with (|23l) to obtain (|22p . 
■ 

Finally, we finish off with the proof of our flow composition lemma, Lemma [TU1 
Proof of Lemma I13t 

To prove that <?i exists, we note that gi is a feasible solution to the linear program defined in (|24p . 
Existence of g 2 follows from that of g\ as (Vi ^ j) g 2 

maximize 0, subject to 
antisymmetry: ( Vi < j) gi (i, j) + gi (j, i) = Q, 

flow conservation: (Vi ^ S U {x, t}) gi(i,j) = 0, 

j 

unit flow: ^ 9i{j,t) = 1, 

value of / 2 at x: ^ gi(xj) = Q d = f2(x,j), 

j 



(24) 



capacity constraint: 



(Vi ^j) sri(i,j) < C!y d = max{/i(i,i),/ 2 (i,j)}. 

Note that capacity constraint, antisymmetry and ([2TJ]) together imply (f2~Tj) . Consider the dual 
of (|24p . with dual variables {d({i,j})}i^j (antisymmetry), {a(i)}i&su{x,t} (flow conservation), 
a(t) (unit flow), a(x) (value of f% at x) and {b(i, (capacity constraint). There exists of 

feasible solution to ([24]) iff the dual objective (|25|) is non-negative. 

minimize a(i) — Qa(x) + Cijb(i,j), subject to 

(V* ^ i) 6(i, j) + d({i, j}) + o(i) = 0, (25) 
(V» 6 S) o(i) = 0, 
(V^j) 6(i,j)>0. 
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Eliminating {d({i, j})}i^j from (|2"5)) yields 



minimize a(t) — Qa(x) + Cijb(i,j), subject to 

(Vi < i) 6(*,j) + o(i) = 6(3, i) + a(j), (26) 
(Vi e 5") o(i) = 0, 
(Vi^i) 6(*",J')>0. 

Further eliminating {6(i,j)}»/j from (f2"o| gives (|27p. 

minimize a(t) — Qa(x) + Cji(a(j) — a(i)) 

a(])>a(i) (27) 

subject to (Vi <G 5) o(i) = and (Vi) a(i) £ [—1,1], 

where bounding each a(i) to be in [—1, 1] does not affect the sign of the objective (|27|), In fact, by 
th e same argumen t as that in the proof of the max-fLow min-cut theorem (see, e.g.. Chapter 12.2 
of IVaziranj (EpOlh (JUJ) is equivalent to the following integer program 

minimize a(t) — Qa(x) + Cji(a(j) — a(i)) 

a(j)>a(i) (28) 
subject to (Vi G S) a{i) = and (Vi) a(i) £ {-1, 0, 1}. 

Applying the shift (Vi) s.t. a(i) ^ 0, a(i) h->- a(i) + z, where z G [—1,1] to (|28[) . gives us an 
objective which is linear in z and whose minimum is achieved at either z = — 1 or z= 1. Hence 
the minimum in (l28l) can not be less than that of (l29l). 



minimize a (i) — Qa (x) + Cji(a(j) — a(i)) 

a(j)>a(i) (29) 
subject to (Vi G S) a(i) = 0, and either (Vi) a(i) G {-2,0} or (Vi) a(i) G {0,2}. 

Further the objective (|29[) is non-negative iff (|30)) is non-negative where we have just scaled each 
a(i) by 1/2. 

minimize a(t) — Qa(x) + Cji(a(j) — a(i)) 

a(j)>a(i) (30) 
subject to (Vi G 5) a(i) = 0, and either (Vi) a(i) G {-1,0} or (Vi) a(i) G {0, 1}, 

We first assume that (Vi) a(i) G {—1,0} and show that the minimum in pop is non-negative. 
Any assignment (Vi) a(i) G { — 1,0} defines a cut in G between nodes {i | a(i) = 0} (including 
5) and nodes {i | a(i) = —1}. If a(t) = 0, 

a(t)-Qa(x)+ G oi{aip)-a{i)) > £ Cji(a(j) ~ «(*)) > £ ACm) = 0, 

a(j)>a(i) a.(j)>a(i) a(j)=0 

a(i)=-l 
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since (Vi 7^ j) Cji > fi(j,i) and fi is a flow from S 1 to t which are all on the same side of the 
cut. On the other hand, if a(t) = — 1, we have that 

a(t)-Qa(x)+ CjMj) - a(i)) > -1+ ^ = -1 + 1 = 0, 

a(j)>a(i) a(j)=0 

a(i)=-l 

since (Vi 7^ j) Cjj > /i(j, i) and /1 is an unit flow from S to t. 

Now assume (Vi) a(i) £ {0, 1}, which defines a cut in G between nodes {i | o(i) = 0} (including 
S) and nodes {i \ a(i) = 1}. If a(t) = a(x) = 

a{t)-Qa{x) + Y, Cji(a(j) - a(i)) > £ h{j,i) = 0, 

a(j)>a(i) a{j) = l 

a(i)=0 

since (Vi 7^ Cji > /i(j, i) and 5, x, i are all on the same side of the cut. If a(t) = 1, a(x) = 0, 
a(t)-Qa(x)+ G iM3) ~ <*(»)) > 1 + £ / 2 ^*) - 1 ~ 1 = °> 

a(j)>a{i) a(j) = l 

a(i)=0 

since (Vi 7^ j) Cji > /2C7, i) and /2 has a total flow value of —1 across any cut from t to 5 U {x}. 
If a(t) = a(x) = 1, 

a(i)-Qa(x)+ ^ Cji(a(i) - a(i)) > 1-Q+ ^ / 2 (j,i) > l-Q-(l-Q) = 0, 

a(j)>a(i) a(j)=l 

a(i)=0 

since (Vi 7^ j) Cji > /2O', i), /2 has a total flow value of —(1 — Q) across any cut between S and 
t, and the flow from x does not contribute to this cut, x being on the same side as t. Finally if 
a(t) = 0,a(x) = 1, 

a(t)-Qa(x)+ Y WaV) - a{t)) > -Q+ £ /a(j,0 = -Q + Q = 0, 

o(j')>o(i) a(j)=l 

a(i)=0 

since (Vi 7^ j) Cji > /2O, i) and / 2 has total flow Q from x to t, and 5* does not contribute being 
on same side of the cut as t. Hence ([50)1 is non-negative and the Lemma follows. ■ 

3 Gaussian MRF on Bounded Tree-width Graphs 

In this section we will give an approximately optimal algorithm, to be precise a FPTAS, for 
Gaussian MRFs on bounded tree- width graphs. Consider a Gaussian MRF X = (X-y, A 2 , ■ ■ . , X n } 
on graph G of tree- width at most k. The bound on the tree- width implies that A is sparse, 
having fewer than kti non-zero entries. Note that the precision matrix A is defined only for 
non-degenerate Gaussians (i.e., with covariance matrix of full rank), and hence we will assume A 
is of full rank. In this section, we present an approximation algorithm based on message passing 
for such a Gaussian MRF. We will show that for the special case of GFFs, the running time of 
this message passing algorithm scales as n°( K > in the number of variables. 
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We are going to use the notions of tree-decompositions and elimination orders associated with 



such decompositions, and refer to iBodlaenderl (|2007l ) for a survey. Let ( {V"i, V2, • ■ • , V m }, T = 

({1, 2, . . . , m}, F^j be any tree-decomposition (i.e., a junction tree) of G of width k' > k. As we 

show later, using shallow tree-decompositions of width greater than the smallest possible (i.e., 
k) may yield faster algorithms. Hence, each cluster Vi C {1, 2, . . . , n} has size at most k' + 1. 

Note 1. We will assume w.l.o.g. 

• that m > n (using a bigger width k! > k, might lead to more than n clusters), 

• that each non-leaf cluster in tree T has degree 3 , 

• that T has an empty cluster, V m = (say), which is a leaf, and 

• that l,2,...,?i is an elimination order for the given tree- decomposition. 

Such tree-decompositions exist — one can always tr ansform a given tre e-decomposition into a 
strictly binary tree (using, e.g., a trick like Figure 4.3 of lBodlaender first and then add the 

empty cluster V m as the 3 rd neighbour of the "root" (i.e. the cluster with exactly 2 neighbours). 
These assumptions about the tree-decomposition will help us give a clearer presentation of our 
algorithm. 

Now each edge {i, j} € F splits the tree T into 2 component subtrees Ty (containing cluster 
i) and Tji (containing cluster j) consisting respectively of the following nodes in G: 

Vy d = f |J V t and V 3l d = l |J Vi. 



Now define the sets , as 

Ay d = Vif)Vj =V ij nV ji and T tJ d = V t \ V, 



The set Ay separates nodes in G into Vy \ Ay and Vji \ Ay. This means the variance of 
any node in Vji \ Ay docs not depend on which nodes in Vy \ Ay are observed, as long as 
we know the conditional joint distribution of variables in Ay given these observations. For the 
case of Gaussian MRFs, the conditional distribution of variables in Ay happens to be specified 
fully by the joint precision matrix. Intuitively, one can think of the observations in Vy \ Ay 
as inducing a "prior" on the shared variables Xa^ ■ This Markov property allows us to use a 
dynamic programming algorithm. 

For our message passing scheme to work, we need to factorize the joint density function of X 
into a product of densities, one for each set Vi, as follows. 

Lemma 14. There exist precision matrices Ay 1 , Ay 2 , . . . , Ay m which give the factorization 

exp ( - ^X*Ax) = f[exp(- ^Ay.xY (31) 



j'=i 
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and which have the following properties. For each j, Ay. has support Vj x Vj (i.e., Ay. g^^^j, 
ftos ranA; | V} | , and 

- < A mi „(Ay ) .) < A mox (Ay 3 .) < X max (A). (32) 

Moreover, Ay , Ay 2 , . . . , Ay m can be computed from the given tree-decomposition in time 0(m«/ 2 ) 
Proof of Lemma I14t 

Si nce 1, 2, . . . , n are i n elim ination order, the Cholcsky decomposition (see, e.g., Chapter 2.6 
of Horn and Johnsonl ( 1985b) of A — A TO j„(A)7„ = {/*{/ has the following property. For each i 



there exists some cluster Vj i in the given tree-decomposition such that the support of the i th row 
of U, U[i, :], is included in Vj t , i.e., {/ | U[i, 1] ^ 0} C Vj t . This means 



X 



Ji j X 1 I ^ L ' J / J — 3i 

'(a-A^A^X = £x*l/[i, :]*£/[*, :]X = £ :]'£/[*, :])x 

J=x (33) 

= £ x%x, 

where for each j, A'y. = f 53rj»=i •] e X^ 3 ^ 3 . Now notice that /„ can be "split" 

into diagonal matrices, I n = Py + Py 2 + • • • + X>y m , such that in each 2?y. , each entry in the 
principal diagonal indexed by Vj is at least 1/m and the rest of the entries are 0. Hence the 
matrices 

(Vj) Ay. d = A' Vj + X min V Vj (34) 

satisfy the lower bound on eigenvalue in (f3"2")l . i.e., A "" r ^ A - ) < A TO i n (Ay 3 ). The desired factor- 
ization pip as well as the upper bounds on eigenvalues in (|32[) . i.e., \ m in(Av) < X max (A), now 
follow from combining (|33|) and (|34|) . 

As for the time complexity, note that since 1,2, ... ,n arc in elimination order, each step in the 
Cholcsky decomposition algorithm takes only 0{k! 2 ) time using sparse matrix representations 
and hence the total running time is 0{mK 12 ). ■ 

We will also make use of two transformations for precision matrices, Obs and Marginal, which 
correspond respectively to observing some variables and computing the marginal over a subset 
of variables in each cluster. 

Given precision A' G XY xV and a set of observed variables O C V, function Obs transforms 
A' into a marginal precision matrix for variables Xy\o by setting the rows and columns of A' 
indexed by O to 0. In other words, if A" = Obs(A', O), then A" has support (V\ O) x (V\ O) 
and 

A"[V\0, V\0] d = A'[V\0, V\0]. (35) 
Transformation Marginaly A (A') computes the precision matrix of marginal distribution of 
variables in A. Transformation Marginaly A (A'), where A C V, is defined only if A[V \ 
A , V \ A] is of full rank. Margi nal^ A (A') has support A x A and (see, e.g., Chapter A. 2 
of Rasmussen and Williams! (2000)) 



Marginal^ (A') [A, A] = A'[A,A] - A' [A, V \ A] (A'[V \ A, V \ A]) 1 A'[V\A,A]. (36) 
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In other words, Marginal^ A (A') can be thought of as a sequence of 3 operations: first invert 
A'[V, V], then take a principal submatrix (indexed by V \ A) of the inverse, and finally invert 
the resulting submatrix. Intuitively, Marginal yA corresponds to "integrating out" the other 
variables Xyu, and the following lemma makes this intuition precise. 

Lemma 15. For any A' <G X^ xV and any A C V , if Marginal^ A (A') is defined, then 
f exp ( - ix'A'x) U dX t = C(A', A) exp ( - ^X* Marginal^ (A') X 

where C(A', A) does not depend on Xy\ A . 

The complexity of computing Marginal and Obs are 0(|y| 3 ) and 0(|y| 2 ) respectively when 
the input matrix has support V xV. We note that application of Marginal or Obs does not make 
the smallest non-zero eigenvalue of the input matrix any smaller. This property will be useful 
later (Section 13. 2p when we discuss how to perform approximate message passing. 

Lemma 16. For any A' S X^ xV of rank \ V\, any O C V and any A CV, 

A mm (Obs(A', 0))>A mm (A') and A mm (Marginal^ (A')) > A min (A') 
Proof of Lemma I16t 

That A m i„(Obs(A', O)) > A TO , n (A') follows from the fact that taking a principal submatrix of 
a positive definite matrix (more generally, of any symmetric matrix), in this case A'[V, V], does 
not decrease the smallest eigenvalue. Similarly, Marginal^ A (A') consists first inverting A'[V, V], 
then taking a principal submatrix indexed by V\ A, and finally inverting the resulting submatrix. 
The largest eigenvalue of the inverse, (A'[V, V]) -1 , is at most A TO , n ((A'[VJ V])) , which can only 
decrease after taking the submatrix in the second step. Hence after the final inversion in the 
third step, the smallest eigenvalue is at least X m i n (A'[V, V]), which equals A ro , n (A') since A' is 
of rank |V| and has support V x V. ■ 

For any set A C Vi and a set of observations S C Vij, we will say that variables X A have 
precision P given S if P is the precision matrix of the "prior" induced by observations S on 
variables X A , considering only those factors in Lemma [HI which belong to the subtree T t j in the 
given tree-decomposition. 

Definition 17. Consider any edge {i, j} € F in the tree-decomposition. Then for any set A C Vi 

and any set of observations O C Vij , we say X A has precision P given O in Tjj if 

P = Marginaly.^Q A \ Obs J2 

Note that considering only factors which belong to subtree defines a different set of 
random variables with a different density than the original set, i.e., X = (X\, X 2 , ■ ■ ■ , AT„) ; 
however, we will slightly abuse the notation and still use X, and the actual distribution of these 
random variables should be clear from the context. We also need to introduce the following 
notation which describes the total error achieved in a subtree of the given tree-decomposition 
given observations in that subtree. 
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Definition 18. For each edge {i,j} G F, for any precision matrix Q 6 x^'j xA *j consider the 
(origin-centred) Gaussian density /0y,Q on variables Xy 4 . in subtree Ty defined by the precision 
matrix Q + J2ier- ■ Then for any set O C Vy of observed variables, 

Ra(Q,o) = £ v Py , Q [x, |x ]. (37) 

In other words, i?y (Q, O) is the total error of variables in Vy \ Ay when Xa^. has Q as a "prior" 
due to observations which lie outside Vy. 

We are now going to describe an idealized message passing algorithm which finds the exact 
optimum of the budget version, but uses messages that are functions on continuous domains. 
Later we will show how to round the messages (making their size polynomial) at the cost of 
producing an approximate solution. 

Intuitively, the message is a function that gives the optimal total error in one part of the graph 
(Vij) for every possible way of splitting the budget between the parts (Vij \ Ay, Vji \ Ay, Ay), 
for every possible choice of observations in Ay (respecting the budget allocation), and for every 
possible pair of distributions of the shared variables (Ay ) where the first distribution comes from 
the Gaussian MRF on Vij (that is, using only the factors in (f3"Tj) that are in Vij) and the second 
distribution comes from the Gaussian MRF on Vji (again, using only the factors in (piTj) that are 
in Vji \ Ay ) . Note that wc allow the allotted number of observations for first (and also for the 
second) distribution but their location is not communicated in the message (this is justified by 
the Markov property discussed earlier). 

The message passing algorithm proceeds in a sequence of rounds. In each round, cluster i in 
T optionally sends a message aj_y to its neighbouring cluster j along edge {i,j} € F. In the 
first round, each leaf in T sends a message to its (only) neighbour. Once i has received a message 
from each of its neighbours excluding j, i sends a message to j exactly once in the following 
round. The message ai-y is a function 



^A i3 XA„ x ^A i3 XAy ^ ^ x {(), 1,2, . . . , b} 



interpreted as follows. Given Q,P e ^ A « xA «, S C Ay and N < b, let S*(P,S,N) be the 
collection of all sets S' C Vij , each of which have the property 

• that S' (~1 Ay = S (the variables in S are the only ones to be observed among Ay ) , 

• that \S'\ < N (at most N observations are allowed in Vij), and 

• that Xi ;j has precision P given (only) observations S' in subtree Ty. 
If S* (P, S, N) is non-empty, then ai_y (P, Q, S, N) is defined as 

a t ^(P,Q,S,N) = min V V[l t |X sw ], (38) 

s'&s*(p,s,n), s"gv}AA y tp J-? A .. 

Xa has precision *i 
Q given S" in Tj t 

Otherwise, a 2 -y (P, Q, S, N) = +00. We point out that by ([37]). one can also express a;-y as 
a t ^(P,Q,S,N) = min R tJ (Q,S'). 
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It will be useful later to have a notion of the height of a message, defined as one plus the 
height of the maximum of the heights of messages from which it was composed, with the messages 
sent from leaves having a height of 1. 

Given the definition of <x;_>j above, its value when cluster Vj happens to be a leaf (i.e., when 
(Xi^j has height 1) for arguments Pij,Qji £ X Aijy - Aij , Sij C A^- and Ni can be expressed as 



^i— >j > Qji ! Sij , Ni) — 



min Tr((A' Vi [W,W]) with 

V J (39) 



|£y|+|Sy|<JV, 

V( = Tij \ Lij and M v = Obs(A Vi + Q Jit S tJ U L tj ) 



where the set in (|39p must satisfy the additional constraint that Xa ( . has precision given 
observations Sij U in . 

Next we describe how the messages are composed for internal nodes. Suppose cluster i has 3 
neighbours j, k and I, and consider the earliest round by which i has received messages and 
a;_>i from clusters k and I respectively. Then the function aj_> 3 - for arguments Pij , Qji , , Ni 
is composed in the following round recursively using (j40l) - ([43l as follows. 



(40) 



Oi^.j(Pij,Qji,Sij,Ni) = min a k ^i(Pki,Qik,Sik,N k ) +ai^,i(Pu,Qu,Sii,Ni) 

Pki,Qik,Sik,N k 



with 



+ Tr((AV i [K,^])- 1 
V( = \ L^, and A' v = Obs^Ay + P H + P u + (.)„. S tJ U L y J , 



and where the minimum (infimum) in (|40[) is taken over all Ljj C and over all Pki, Qik, Sik, N^ 
and P H ,Qii,Su,Ni satisfying gl]), g!]) and (J43J) below: 



5ift = (Sn U L i3 -) n A lk , fia = (Sy U L tf ) n A, ; , + |i <3 -| + JV* + JVj - |S ifc | - < A^, 

(41) 



P y = MarginaV A(g .. ui .. )) A ..\ S .. ( Obs( A Vi + P kl + P /l; 5« U L <3 - ) ) , (42) 
and further. 



Q ik = Marginal yA(S .. uz ,.. )! Aik \ Sik (obs^A 
Qu = Marginal yA(SijUiij)i AiI \ Si , ^Obs^Ay, + P ki + Qji, Sij U LyJ J 



(43) 



We will use P^, Q* fe , 5* fe , AT£, P ; * , Q* ; , and iV* to denote the values of P k i,Qik,Sik,N k , 
Pii,Qii,Su,Ni respectively for which the minimum in (|40|) is achieved. 

Once we have the ideal message passing algorithm, it is easy to describe how the approximate 
messages are composed. Assume that there is a transformation Round e which, for any e > 0, 
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e-approximates any precision matrix with support V x V by mapping it to an element of an £-net, 
■jVxv ^ f or g^]^ matrices. We defer the precise definitions of our notion of e-approximation, the 
transformation Round £ and the e-nets iy xV until later. The approximate messages will be only 
defined for precision matrices in xY xV where V C Ay, and we will only require (|4l?)) and (|4"5)l to 
hold approximately. To be precise given arguments Pij, Qji-, Sij, N%, we have 

Oi^j{Pij,Qji,Sij,Ni) = min a.k^i(Pki,Qik,Sik,Nk) +ai^i(Pii,Qu,Sii,Ni) 

Pki,Qik,S ik ,N k 

+Tr^k' Vi [v:,v:])- 1 ' 

with 

V( = T i:j \ L i:j , and k' v . = Obsf A Vt + Pu + P kt + Q ju § tj U L i3 , 



and where the minimum (infimum) in (|44p is taken over all Lij C Tij and all P k i,Qik, Sik, N k 

and Pu,Qu,Su,Ni such that P k i 
satisfy (gS]), gB]) and (g7J) below: 



and PiuQiuSiuNt such that P fci ,Q ife e 2**\** xA ^ 5 *, A,, Qa € I^ iA5 " xA,A5 ", and which 



S ifc = U L y ) n A tk , S tl = (S^ U Lij) n A«, |,%| + \Lij\ + iVfc+iV; - \S ik \ - \S a \ < N u 

(45) 

Pij = Round £ rMaxginal VA( § yUiy)i A . . ^Obs(A Vi + P fei + Pi, S l} U LyU ^ , (46) 



and 

Q ik = Round s (M^ginaViVOS uz, ) a - fc \s fc ( 0bs ( A v; + -Ai + Qj»> SyULy 
Qu = Round e ( Marginal K ^ u ^ _ Ali \Sii ( 0bs ( Ay > + Pki + ^J' 1 ' ^« U Li i 



We will use P ki ,Q* k , S* k ,N k , P*i,Q*i, S^,N* and L*. to denote arguments for which minimum 
is achieved (|4~4"1) . 

Finally, an approximate message sent from a leaf cluster Vi is given by the same equa- 
tion (|39[) as the ideal message, except that the joint precision matrix of Xa,, is rounded to 

r (Ay\Sy) X (Ay\S«) ; „ 

_£.£ 7 I.e., 

ai^j(Pij,Qji,Sij,Ni)= .mm f (A^ft',^']) -1 ), with 

l&il + |£yl<#< ( 48 ) 

y/ = r y \ L <3 -, and Ay. = Obs(A Vi + Q jh Sij U L y ) , 
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where the set Ly in (pf5|) must satisfy the additional constraint that the precision P of Xa 4J 
given observations Sij U Lij in tree T^- is such that = Round e (P). 

Now that we have described how the approximate messages are composed, we describe how 
to construct the e-nets I^xv anc j transformation Round e in the following subsections. 

3.1 Approximate Message Passing for GFFs 

For the special case of a GFF ^ (i.e., when the precision matrix A is a graph Laplacian (fTU)) ) on 
bounded tree- width graphs, we can use a simple notion of approximation which rounds off each 
element in the precision matrices being passed. Before we analyze this approximation scheme, 
observe that given our tree-decomposition, one can split A as A = ^\ Ay simply by defining 
Ay for each i to be the Laplacian of a subgraph induced by vertices in cluster Vi, with the 
understanding that each edge shared by 2 or more clusters is assigned to the subgraph induced 
by exactly one of these clusters. Unlike LcmmafT^l this split docs not use Cholcsky decomposition. 

Observation 19. For a GFF, there exist precision matrices Ay x , Ay, , . . . , Ay m E gnxn suc /j 
that each Ay. has support Vj X Vj , and which give the factorization 

cxp^-ix'Ax) = f]_exp(-ix*Ay 3 .x). 

The matrices Ay, Ay 2 , . . . , Ay m can be computed in time 0(mn) using sparse representations. 

In fact for GFFs all the precision matrices obtained during message passing are going to be 
symmetric diagonally dominant with non-positive off-diagonal entries, i.e., from g nxn . 

Definition 20. Given 2 matrices Q, Q' <E g nxn anc j an y e > 0, we will say that Q ~ e Q' if 

(Vi^i) c^\Q[i,j}\ < \Q'[i,j}\ < e s \Q[i,j}\, and 
(Vi) e- £ ^QM < ^Q'[i,i] < c £ ]TQM. 

3 3 3 

Note that the definition requires that the respective row sums, which are non-negative since 
matrices in g nxn are diagonally dominant, rather than the diagonal elements, be approximately 
equal. It is easy to verify that ~ satisfies the following "triangle inequality" . 

Observation 21. If Q,Qi,Q2 € g nxn and £i,£2 > are such that Qi ~ £l Q and Q2 ~ £2 Q, 

then Qi ~ El + e . 2 Qi- 

We approximate the precision matrices obtained while message passing by rounding the off- 
diagonal elements and the row sums, for which we need to calculate the range of values these 
elements can assume. Each non-zero element (or row sum) of each precision matrix obtained 
while running the ideal message passing algorithm (pfO"| - P5)) for a GFF lies in the range [q,c/,], 
with ci,Ch as defined in P5]) below. Intuitively, Ch is the largest possible value of the effective 
conductance (i.e., inverse of the effective resistance) between any 2 nodes in the electrical net- 
work associated with the GFF. Similarly q is, roughly speaking, the smallest possible effective 
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conductance between 2 nodes. Our e-nets for GFFs are going to be the following subsets of g nxn . 
lY xV d = |p E G nxn I P has support V x V and (Vi 7^ j) |P[i,j]l € C e and 

(Vi) (l>M) e£ *}' 



where 



dcf 



|o| (J |q, e e Q, e 26 Q, . . . , e L ( = X ''^cjj, and where 
min A[i,j]>o |A[i,j]| 2 _ min^j Tjj 



(49) 



Ch 



nmaxi^tj \A[i,j]\ in 
nmaxi^j \A[i,j]\ n 



2 2 mm i7 tj r„ 

Observation 22. For any e > and ,se£ V, the size ofXY xV is bounded as 



rVxVl 



2 + - In 



\ / »i2 may, ■ 1 T7i r 2 . \ \ 1^1 



r 2 . 

1 J 



We are now ready to define our transformation Round £ . Once again, note that we round-off 
the row sums instead of the diagonal elements. We point out that our definition of Round e is 
such that each off-diagonal element and each row sum of each precision matrices obtained during 
approximate message passing always stay within the range [q, Ch] (with q, Ch as defined in (|49p). 
For any e > and for any P € Q nxn ) P' = Round 6 (P) is given by 

( Vi 7^ j) P'[i,j] d = — arg min \r — \P[i,j]\\, and 

r€C E 

(Vi) P'[i,i] = f arg min |r — P[i, j] | + \P'[i, j] \ ■ 

J 3^F % 

Round e has the following property. 

Lemma 23. Consider any e > and any precision matrix P € Q nxn with support (say) V x V 
obtained while running the approximate message passing algorithm (given by (I44[) - (I47I) and (|48[) ) 
for the given GFF. Then Round e (P) eI E w and P w e Round e (P). 

Proof : 

The proof of P ~ e Round e (P) is straightforward given the definition (f2T)]) of ~ e and the definition 
of the e-nctI £ yxV '. ■ 

We next state and prove some useful properties of ~ e which will help us analyze how error 
introduced by rounding accumulates during message passing. Recall that our message passing 
algorithm (given by (f4"l | -(f4"T |) ) makes use of the following operations and transformations on 
matrices: addition, Obs (which is equivalent to taking a principal submatrix), Marginal, Round e 
and finally, the trace of the inverse. Lemma [23] tells us how much error Round e introduces. We 
analyze how each of the other transformations affect the rounding error. 
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Lemma 24. Consider any e > 0. Then for any set V and any Q,Q' € g nxn w ah support V x V 
such that Q' ~ £ Q, we have 

(VOCV) Obs(Q', O) ~ e Obs(Q, O), and (51) 

(VAC7) Marginal^ (<2') « E / Marginal^ (Q), w/iere e' = 3 inA| e. (52) 
Further, for any Q\, Q 2 and Q[, Q' 2 in Q nxn such that Q\ ~ £ Qi and Q' 2 ~ e Qi, we have 

Q[ + Q' 2 Se Qi + Q 2 - (53) 

Lemma 25. If Q,Q' £ <7' IX ™ are smc/i i/mf Q,(5' have support V x V, Q' ~ E Q, and Q[V,V] 
has full rank (i.e., \V\), then 

e" £ Tt(Q\V ) V]- 1 ) < Tr (Q'[V, V]' 1 ) < c e Tr (Q[V, V]^ 1 ). 



Proofs of ((51) and ([53]) are straightforward; we prove Lemma 1251 and (|52|) below. 
Proof of 1(52)1: 

We are going to prove the case where | V\ A| = 1, i.e., when exactly 1 variable is being "integrated 
out"; the general case follows readily by induction on \V \ A|. Assume that Xk, k G V is being 
eliminated so that A — V \ {k}, and let P = Marginal^ A (Q) , P' = Marginal VA (Q'). By 
integrating the joint density of Xy, or to be precise the function exp ( — iX*QXJ, w.r.t. we 
get the following identity. 



(54) 



(Vi^j) \P[i,j\\ = \Q[i,j]\ + Q\kk] ' 

^ ^ \Q[i,k] E,-Q[M 

Similarly, we have by integrating exp ( — ^X'Q'X) w.r.t. X^ that 

|i*Ml - iwjii + m '^ M , -d 

^ _ \Q'[i,k]\T, 4 Q'[k,j] ( 55 ) 

Taking ratios of the l.h.s. and r.h.s. of ((54]) and ((55)) and by the hypothesis that Q' ~ e Q, we 
obtain 

(Vi^j) e" 3£ |P[i,i]| < \P'[i,j]\ < e 3e |P[i,j]|, and 
(Vi) e- 3£ ^P[i,i] < ^P'M < e 3e ^P[i,i], 

which means P ~3 £ P'. ■ 



Proof of Lemma [ 

We are going to use the connection between electrical networks and GFFs, specifically Thomson's 
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Principle, i.e., Lemma [T2l Note that any Q, Q' £ gnxn can ^ th 0U ght of as n x n principal 
submatrices of (n + 1) X (n + 1) graph Laplacians, respectively P and P', whose (n + \) th rows 
(and columns) are defined as 

(Vi<n) P[n + l,i] = -^Qb'^]> ^[n + l,»] = -X;QUi]. 

i 3 

and P[n+l,n + l] = ^2\P[n+l,j}\, P'[n + 1, n + 1] = ^ \P'[n + 

j<n j<n 

Intuitively, one can think of P,P' as GFFs with an additional variable X n+1 , in which case 
Tr (Q[V, Tr (Q'[V, V] -1 ) are simply the sum of conditional variances of each X[, I € V, 

given X n+ \. However, by Thomson's principle we can express the conditional variance of Xi 
given X n+ i as the minimum energy of a unit flow from n + 1 to I. To be precise, we have by 
Lemma [12] that 

Tr (Q^yp 1 ) = E«f (56) 
lev tl & l^'JJI 

where each /; is constrained to be a unit flow from node n + 1 to node Similarly, 

lev h i^j 1 [l,Jil 

where each / ; ' is is constrained to be a unit flow from node n + 1 to node I. Now note that since 
Q Q' i we have by definition of P, P' that 

(\fi^j) e- E \P[i,j}\ < \P'[i,j]\ < e E \P[i,j}\. (58) 

If for each / /* denotes the optimal flow in ([55)) . then 

Tr{Q'[ Vl Vr) < £ ESI ( b y®) 



^ ^ E Et^H" = c £ Tr(Q[y,y]~ 1 ) (by (ESI and (ESI)). 



The proof of C - £ Tr (Q[V, V]- 1 ) < Tr (Q'[V, V}- 1 ) is identical. ■ 

Finally, we analyze the error of our approximation scheme. We first prove that the approx- 
imate messages are close to the ideal messages (Lemma [51]), and then show how to extract an 
approximately optimal solution from the approximate messages (Lemma I27|) . 

Lemma 26. Consider any e > and any message Qfi_>j of height h. For any Pij, Qji, Sij , Ni 
for which {Piji Qji-: Sij, Ni) is finite and for any Qji —°^2n'h' e Qji there exists Pij —r^2n'h e Pij 
such that 



(Pij , Qji , Sij , Ni) < exp (3 2K ' n»x{fc,h'}+2«'fc e ) a . (P . . i q .. } S .. jN .y (59) 
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Proof of Lemma [ 

The proof is by induction on the height h of the messages a^j, di_>j. The base case, h = 1, 
corresponding to messages sent from leaves, is easy to verify by comparing (|39[) to (|4"5|) . Let 
the minimum for the ideal message (|39[) be achieved for L.^ = L*j. Then consider the objective 

for the approximate message (j3SJ) with Lij ~ i*-, Py = Round e (Pjj), S,j = Sij, iVj = iV<, and 

V7 = \ L|j = V/, where V? is as defined in ((221) • Note that the objective in each of (|55|> . (gg|) 
is the trace of the inverse of a submatrix obtained by applying the transformation Obs to sum of 
and Qji,Qji respectively. Since Qji E2 3 2„'h' e Qji (by hypothesis), we have by fact (|53|) (for 
the sum) and fact (f5Tj) (for Obs and submatrix) that 

Obs(Ay, + Qji, U L*j) ~ 3 2,' h ' e Obs(A V! + Q jU Sij U 

Therefore, 

a -(P.. Q S- N-) ^(WiM' 



where in the second step we used Lcmma [25l to account for taking the trace of inverse of precision 
matrices. Note that since Pij = Round e (Pjj), we trivially have Pij 53 3 2«' e Pij. 

Hence it remains to verify the induction step for messages of height h > 1 . Let P ki ,Q* k , S* k , N k , P ; * , Q* t , S* t , Nj* , L*j 
be the optimal choice in (|40|) for the given arguments Pij, Qji, and iVj. 

Now consider @D with = S a = S*„ N k = N* k , N t = N* and Uj = L*j. Let P kl ,P u 
be such that 

Pki « 3 w(h-i) e f& and P K « 3a „/ (h _i )e P ( *. (60) 
Further, let Qik,Qu, and pj be given by (|47|) and (|46|) respectively. 

Now Qjfc is obtained (see (HZ])) by successively applying transformations Obs, MarginaL^^ .yj^ .) A . h \§ (k 
and Round e to the sum of Ay ; , Qji and Pu. Hence by facts (|5Tj) (for Obs), (j52"j) (for Marginaly.w^. .-, \. k \s k ), (E3J) 
(for the sum), Observation [2~T1 (for Round £ ) and ((60|) . we have that 

Qifc « e ' Q*fc where e' < e + 3 K ' max {3 2k ' e , 3 2K ' h 'e} < 3 2k ' max{ ' 1 ' (61) 

Similarly we have Qiz ~ £ ' Q*;, and that (see (|46p) 

Pj «„« P*j where e" < e + 3 K ' 3 2K ' {h - 1} e < 3 2K ' h e. (62) 

To complete the induction step we compare the corresponding terms on the r.h.s. of equa- 
tions (|44p and (|4"0)l. which define how messages a^j and a^j are composed respectively from 
dfc-H, and «*,_>», a/_y 4 . 

First, by induction hypothesis there exists Pt; which satisfies ([60]). i.e., P ki ~ 32/t 'ch-i) e P^, 
and for which the messages dfc_>i and ctfc-n (which are of height h — 1) satisfy 

&fc-n (Pfci ; Q^fc i ffjfc , A^fc) | o2«' max{max{h, (h'+l)}, (/i-l)}+2«'(ft-l) \ 

^(P fe VQ* fc ,S* fc ,iV*) - 6XI V V (63) 

< exp (3 2fi ' max f' 1 ' 'i'}+2k'/i £ ^ 
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where we used the fact (|6"Tj) that Qik ~ E > Q* k . 

Second, there exists Pu which satisfies (|60[) . i.e., Pu ~ 3 2«'(?i-i) E Pu, and for which similarly 

Otl^i(Pli,Qil,Sil,Ni) /o2k' max{/i, h'}+2n'h _\ ( RA \ 

a^P^S^N?) ~ 6XP ^ ^ (64) 

where we used the induction hypothesis and the fact that Qa ~ e / 

Finally, it only remains to compare the trace terms in (|44[) and (|40[) . Since by definition 

Pfei ~ 3 2 K '(h-i) e Pfcj and Pji « 3 2 K '(h-i) e Pj*, and since by hypothesis Qjj ~ 3 2 K 'h' e Qji, we have by 
facts (|5TJ and j53]) that 

A Vi[^')^'] ^majc{33»'Ch-i), 3=-'"'}e A V, [V- ,V-] (65) 

(where note that V( = Vi since = and Py = L*j ) . By Lemma [25] and (|65]) , the ratio of 
the trace terms is bounded as 



< exp(max{3 2K ' (/l - 1) , 3 2K ' h '}e). (66) 



Tr^(A' Vi [V!,V!}y 
Tv({K' Vi [V(,V!]y 



Combining (|6"3"]l . (|B~4")) and (|6"6"]) we obtain (f5T)]) . Since we have already established in (|B"2"]) that 
Py ~ 3 2 K 'h e P*j, this completes the proof. ■ 

Lemma[25]establishes that the optimal value of approximate messages is not much larger than 
the optimal value of ideal messages (which corresponds to the optimal solution). The converse 
of Lemma [26] is stated in terms of the total error in a subtree given a set of observations in that 
subtree, which we defined in (l37l). 



Lemma 27. Let a^j be a message of height h, and consider any Pij,Qji, Sij, iVj for which 
cti^j(Pij,Qij,Sij,Ni) is finite. Then the set of observations S C Vij, obtained by solving the 
approximate dynamic program (|44p - (|47[) on subtree Tij, is such that SnAy = Sij, \S\ < Ni, and 
for any Q 3i « 32K 'h< e Qji, 

Rii(Qji,S) < exp(3 2K ' m ^ h '^+ 2K ' h s)a i ^j(P ij ,Qj i ,S i j,N i ), (67) 

and such that the distribution on Xa^ given observations in S has precision Pij in Tij, where 

Pij ~ 3 2 K 'h £ Pij ■ 

Proof of Lemma I27t 

The proof is by induction on the height h of a^j . 

We first prove the base case of h = 1, i.e., when cluster Vi is a leaf. Consider the optimal 
solution, say L*j , to the approximate program for leaves given by (gSJl , and define V- = Tij \ L*j . 

Note that the optimal solution S for subtree in this case is simply given by S = Sij U L*j . 
Now, when cluster Vi is a leaf the error function R^, as defined in ([37]) . happens to be 



R i j(Q ji ,S)=Tr((A' Vi [V;,Vn) with 
= Obs(A Vi + Qj U Sij U L*j). 
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We prove the case h = 1 by taking the ratio of the r.h.s. of to that of (US]). First, notice 
that the matrices A' v . in ([68]) and Ay. in (|48|) are obtained by applying the transformation Obs 
to the sum of Ay t and respectively Qji and Qji. Since Qji ~ 3 2 K 'h' e Qji by hypothesis, we obtain 
by applying ([53"]) (for the sum) and (|5T|) (for Obs) that 

A> Vi « 3 wv e A'y. (69) 

Hence by Lemma [231 the ratio of the r.h.s. of to the r.h.s. of (|48p is at most 



Tr((A' Vi [Vf,V/}) 1 



< exp(3 2K ' h 'e), (70) 



Tr[(A' Vi [V?,V/]y 



which is as desired. Notice that the constraint in (|48|). that the true precision Py of Xa« given 
observations <Ssj U satisfy Py = Round £ (Pjj), implies that Py ~ 3 2 K ' e Pij is trivially true. 
This completes the proof for h = 1 . 

Next consider a message a^j of height h > 1. Let the optimum in (|44[) for arguments 
Pij , Qji , Sij ,Nibe Pfa , Q* fc , S* k , A>* , Pfc , Q* , S**, , iVf , £? . . The solution S consists of 

5 = SiUS k U L*j. U Sjj where 
S n Ay = , 5 n r 4 j = L* v] , and 
S; C Vu, Si n A l( = S* h and 

s fe c v ki , s k n A lfc = s* k . 

Define = S* J ,S lk = S* k ,N k = N%,Su = S* t ,Ni = N?,L tj = L* vj . Let P ki be the precision 
matrix of the distribution of XA lt given observations Sk in subtree T k i, and let Pu be the precision 
matrix of the distribution of Xa 4! given observations Si in subtree Tu. Further, let Pij, Qi k and 
Qu be as defined in respectively (|42|) and (j43|) . 

First, notice that by the induction hypothesis for height ft, — 1, the matrices P^i, P;, as defined 
above must satisfy 

Pki ~ 3 2 K < (fe -i) e P kl , and (71) 



Pi ~ 3 2K'(h-i) e Pu- (72) 

Now compare the definition of Pij in (|4"2"]l to that of Py in (|43)). which involve an addition 
and application of Obs and Marginal. We have by a combination of (f7Tj) . ([72]) , and Lemma [24] 
(for the addition, Obs and Marginal operations) that 

p ij ^e+3-'3 2 «'< h - 1 )£ p ih which implies P„ ~ 3 2*'h £ Pij, (73) 

as claimed. 

Further comparing our definition of Qi k in ([43]) to that of Q* k in ([47]) , we have 

Qik ^ c +3"'max{3 3 »'fc',3 3 '''<fc- 1 >}B Qik, which implies Q ik « 3 2 K 'max { h, (h'+i)} £ Q* k , (74) 
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where we have used {72]) and the hypothesis Qji ~^ K 'h' e Qj% along with Lemma [2J] to account 
for the addition, Obs, Marginal and Round e operations used in (|43| and ([47]) . Similarly, we have 



Qil ~ 3 2 K 'm a x{h, + Qii- (75) 

Next, note that the error function i2y (as we defined it in (|37jl ) satisfies the recurrence 



Rij(Qji,S) = R k i(Qik,S k ) +Ru{Qu, Si) +Tr ^(A' Vi [V? V?]) with 
V? = T l3 \ L* r and A' y , = Obs^Ay, + Pu + Pki + Qji, U LfcJ . 



(76) 



We are going to prove the induction step for h > 1 by comparing the corresponding terms 
in ([76]) and (|44|) . By the induction hypothesis for message &k-yi, which is of height h — 1, we 
have 

Rki(Qik,Sk) ^ /g2»e'max{max{h,(h'+l)},ft-l}+2ie'(h-l) e N 



a k ^i(Pki,Qik,S ik ,Nk) (77) 

< exp(3 2K ' max{ ' l '^ i}+2K '' l e), 



where we used ([74]). Similarly, by the induction hypothesis for Q!(_>j (of height h— 1) and by ([75]) 
we get 

f"^'' 5 ^ < exp(3 2K ' max {' 1 ''' l >+ 2 ' i '' l e). (78) 

ai^-i(Pu,Qii,Su,Ni) 

It only remains to compare the trace terms in ([76]) and ([44]) . However note that by f|53f) (for 
addition) and ([ST]) (for Obs) combined with (|71|) . (|72|) and our hypothesis that Qji ~ 3 2 K 'h' e Qji, 
we get 

Ayr ~ max r 3 5 S '(»-i) i A^, which trivially implies A'y. ~ 3 2 K '»ax{h',h}+2 K 'h e Ay s . (79) 

Finally by ([79]) . ([ST]) (for Obs) and Lemma [25] (to account for the trace of the inverses), the 
ratio of the trace terms in (1761) and (l44l) is at most 



< exp(3 2K ' max{ ' lVi}+2K '' l e). (80) 



Combining (|77|). (|75|) and ([50]) we get (J57J as desired. ■ 

Lemmas [25] and [27] show that the error introduced due to rounding off each element of the 
precision matrices can scale exponentially in the height of the tree-decomposition. In general, 
tree-decompositions can be unbalanced, i.e., have height f2(n) (as in the case of, e.g., a simple 1-D 
chain), implying that the error can scale exponentially in n. However, we will see in Section [3.1.1l 
that there is an algorithm due to Bodlacndcr which always produces balanced tree-decompositions 
of height O(lnn), thereby allowing the error to be bounded as a polynomial in n instead. 
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3.1.1 Message Passing for GFFs Using Balanced Tree-decompositions 

We first state Bodlaender's construction of a balanced binary tree-decomposition of the graph. 
Lemma 28. (Theorem of \Bodlaendei\ \l98&) and Theorem 1.1 o ftBodlaendek tl99& )) There 



exists an algorithm which, given any graph of tree-width k onn nodes, constructs a tree- decomposition 
having width k' < 3k + 2, height at most 2|~log 5 / 4 (2n)] , size at most 20n and satisfying our re- 
quirements in Note[l] (see vaae \13\) . The running time of the algorithm is n ■ 2°( K ). 



The usual constructions (see, e.g., Bodlaender ( 19961 ) ) can pro duce tree-decompos itions having 



height ( and diameter) linear in n. Bodlaender's transformation ( Bodlaender , 1988() . based on a 



result of iMiller and ReiJ (Il985lh takes a tree-decomposition as input and produces a wider but 
shallower decomposition (of height logarithmic in n), which means the round-off error due to 
approximations in message passing would be smaller than that on the original tree-decomposition. 
Hence, by Lemma [28l we can assume that each cluster Vi has size at most k' + 1 < 3k + 3, that 
the number of nodes in the tree T is m < 20n, and that the height of T is at most 2|~log 5 / 4 (2n)] . 
This means that for GFFs, the element- wise rounding scheme for precision matrices yields an 
FPTAS. 

Theorem 29. There is a dynamic programming algorithm which, for any k, and any GFF on 
a graph on n vertices with tree-width bounded by k does the following. For any | > e' > and 
budget b, it outputs a set Sf, £ i such that 



eYr(S b s') <(l+e') min err(S). 

K ' ' \S\<b v ' 



The algorithm runs in time 



We remark that the In max -t'^> eE r ' J term in the time complexity can be thought of as the 
number of bits required to describe the input GFF. 
Proof of Theorem I 



Consider the tree T produced by the construction in Lemma [28l We set e = 4 ^ 2 n)' UUK an< ^ run 



the algorithm given by (|44]) - (|47]) and p8j) on tree T, with the transformation Round 6 and the 
e-nets \lY xV } as defined respectively in (|50]) and in (J49J) . Recall that the cluster V m as per 
our assumption is empty, i.e., V m = 0, and is a leaf with neighbour (say) Vi so that Ai m = Vi. 
Our output, Sb, e ', is simply the set of observations extracted from the approximate message 
c\i-^. m {P* m , Qmi, Sf m , b) where Q m i is an all-zeros matrix and where 

Pimi^im = ar § mill c\i^ m (Pi m ,Q m i, Si mi b\ . 

Pirn, S, m CA, m 

Let the optimal solution be and let P* m be the precision matrix of XA jm = X^a given 
observations S^. Now given that we set e = 4 ^yuo K and that height of T is at most 2[log 5 / 4 (2n)] , 
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we have 



err(56 >£ ') = -Ri m (Qmi, Sb, £ >) (by the definition of R im , see (J37J)) 

< exp(e'/4) —ai- >m (P* m ,Q m .i, S* m7 b) (by Lemma l27l and our choice of e) 

n 

< exp(e'/4) -&i-^m(Pim,Qmi, £& n Ai m , b) (where P im is as defined in 

n 

Lemma |2"61 for arguments P* m , Q m i, 5& H Aj m , 6) 

< exp(e'/4)i ( exp(e'/4) ct^™ (/>*„, Q im , S b * n A im , 6) 



(by Lemma [26] and our choice of e) 

< (1 + e') -ai^ m (P* m ,Q mi , S; n A im ,b) ) 
n / 

= (l+e')crr(5 6 *). 



As for the running time, first recall that the time required to construct T is n2°^ K/S " > by 
Lemma 1281 Next, note that all precision matrices produced by message passing have supports 
of size at most n' x k? < 16k 2 , and hence the size of each e-net used (by Observation (|22)1) is at 
most 

2 + llnf^ maX ^>n r " 
which, given e = 4(2^)240* , is of the order 



max {i }eE rij * 

n. In I n L ^ J 

mm^j 



Hence using sparse representations for matrices, we can perform the message passing step (|4"4"]) 
for each edge in T in time 

\ V min^-ry J J 
which gives the claimed time complexity since by construction T has m < 20n edges. ■ 



3.2 Approximate message passing for Gaussian MRF 

In this section we describe a rounding scheme for general Gaussian MRFs on bounded tree- 
width graphs which is based on singular value decomposition and on the usual ordering X of 
positive semidefinite matrices. The element-wise rounding of precision matrices for GFFs (see 
Section [XT]) does not work for general Gaussian MRFs whose precision matrices can be arbitrary 
positive definite matrices instead of simple graph Laplacians. Note that we will be using the 
same approximate message passing scheme given by (|4"4"]) - (|47]) and ([48]) as for GFFs, except that 
the transformation Round e and the e-nets for precision matrices will be different. 
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Our analysis of the error is going to be on similar lines as that for GFFs in Section 13.11 
However, the size of the e-nets is going to depend polynomially on the condition number of 
the precision matrix A. As a consequence, our proposed algorithm does not yield a FPTAS for 
Gaussian MRFs with arbitrary precision matrices unless we impose restrictions on the condition 
number. In contrast, the size of the e-nets for GFFs (based on element-wise rounding of the 
precision matrices) scales polylogarithmically in the size of the description of the input. 

Before we describe the rounding scheme for general Gaussian MRFs, we point out that for 
the special case of trees (k = 1), any Gaussian MRF is equivalent to a GFF — see Lemma I3TJ1 
below. Hence we get a FPTAS for general Gaussian MRF on trees as a corollary of Theorem |2T>1 

Lemma 30. Consider any Gaussian MRF~K = (X±, X 2 , ■ ■ ■ , X n ) on a tree T with n vertices such 
that no 2 variables are independent, i.e., for each pair Xi,Xj, ~E\XiXj] ^ 0. Then there exists a 
vector w g W l and a GFF (Yi, Y 2 , . . . , Y n+ k) on a tree T' where k < n (i.e., T' has at most 2n 
vertices) such that (wiXi, . . . , w n X n ) has the same distribution as that of (Yi, Y%, . . . , Y„) 

given that (Y n +i, Y n+2 , ■ • ■ , Y n+ k) are observed. 

Proof of Lemma I30t 

As usual, A denotes the precision matrix for X. Assume w.l.o.g. that T is oriented such that 
vertex j is a child of vertex i only if i < j. We set the values u)\, w%, ■ ■ ■ , w n in the following 
sequence. We first set ui\ such that for each child j of 1, | A j^^ | < A[j,j]. Consider any vertex 
i > 1 such that Wi has been set, and let i have I children, 31,32, ■■• ,31 > i. We set Wj x , Wj 2 , . . . , Wj l 
(i.e., we scale each Xj t by Wj t ) to be such that: 

• for each t < I, < 0, i.e., each off-diagonal entry in row i is non-positive after scaling, 
and 

• row i is diagonally dominant after scaling, i.e. 




and moreover 

. for each t = 1,2,..., I, | ALM| < A[ Jt , Jt ]. 

We stop after having assigned a value to each u>i . This means that the variables (w\Xi , W2X2 , ■ ■ ■ , w. 
have a diagonally dominant precision matrix, say A', with non-positive off-diagonal entries. 

Now, for each row i of A' which is strictly diagonally dominant (i.e. the row sum is positive), 
we add a new node to the tree T and connect it to node i by an edge. The tree T' thus obtained has 
n + k nodes, where k < n. Note that A' is a principal submatrix, A"[{1, 2, . . . , n}, {1, 2, . . . , n}], 
of a Laplacian A" of size n + kxn + k on the tree T', where the rows and columns of A" indexed 
by n + 1, n + 2, . . . , n + k are defined as 

!— J2i< n i] if i is the only node to 
which j has an edge in T' 
otherwise. 

We complete our proof by letting (Yi, Y2, . . . , Y n+ k) be the GFF defined by the Laplacian A" 
on T', and observing that the principal submatrix A"[{1, 2, . . . , n}, {1, 2, . . . , n}] is the precision 
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matrix of (Y\ , Y 2 , . . . , Y n ) given that Y n+ \, . . . , Y n+ k are observed. Note that we need the as- 
sumption that no 2 variables in the original Gaussian MRF are independent only to make sure 
that the tree T' is connected — this is a requirement in our definition of GFFs. ■ 

Before we describe our rounding idea for larger tree- widths, i.e. k > 1, we make the following 
observation about the eigenvalues of all precision matrices obtained while running the ideal 
message passing algorithm (see (001)- (H3) and (|3"5)0 . We need this property since our construction 
of the e-nets for precision matrices requires the eigenvalues of the latter to be in a bounded range. 

Observation 31. Consider any edge {i,j} in the given tree- decomposition. Consider precision 
matrices P,Q€ ^™ xrl suc h that there exists some set S C Ay and some integer N for which 
a.i^j[P,Q 1 S, N] < +oo. Then P and Q both have support (Ay^S 1 ) x (Ay^S 1 ) and rank |Ay \S\. 
Moreover, A ""^ (A) < A min (P) < X max (P) < A max (A) and A "";' i (A) < X min (Q) < \ ma x(Q) < 

Xmax (A) • 

Observation [31] holds because of the following 2 reasons. First, the way we split the overall 
joint precision matrix A into factors for each cluster (see (|32p in Lemma IT4")) ensures that each 
factor Ay; has rank | Vi | and smallest non-zero eigenvalue at least Am " i ( A ) . Second, in the message 
composition steps (|42|) and p3|) of our ideal algorithm, we use the transformations Obs and 
Marginal which, by Lemma \W\ do not make the smallest non-zero eigenvalues any smaller. 

Definition 32. Given 2 matrices Q, Q' <G X? Xn , and any e > 0, we will say that Q' w e Q iff 
e' e Q d Q' d g 6 Q. 

As with the element-wise rounding for GFFs, we will need the following facts about rj e in 
order to analyze how round-off error accumulates during message passing. 

Observation 33. For any set V, ifQ,Q' € XY xV are such that Q' rj e Q, and Q[V,V] has 
full rank (i.e., \V\), then 

e~ e Tr(Q[F,F]- x ) < Tr (Q'[V, V]- 1 ) < e e Tr (Q[V, V]- 1 ). 

Observation 34. If Q,Qi,Q2 € ;f™ xn and Ei,S2 > are such that Q\ w £l Q and Q2 ~ £2 Q, 
then Qi « ei + £2 Qi- 

Observation [551 follows fro m the fact that if Q ~ R Q' then Q[V, V] w £ Q'[V,V] as well, 
and hence by Corollary 7.7.4 of Iriorn and Johnson! (|l985[ ). the set of eigenvalues of Q[V, V] and 
Q'[V, V] are within a factor e ±£ of each other. Observation I3"4l is a consequence of transitivity of 
positive semidefinite ordering. 

Lemma 35. Consider any e > 0. Then for any set V and any Q,Q' G Xl xV such that 
Q 1 ~ e Q, we have 

(VOcy) Obs(Q', O) ~ £ Obs(Q, O), and (81) 

(VA C V) Marginal y A (Q') w e Marginal y A (Q) . (82) 
Further, for any Qi, Q2 and Q[, Q' 2 in X™ xn such that Q[ sa e Qi and Q' 2 ~ £ Q2, we have 

Q'l + Q'2 ~e O1+Q2. (83) 
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Proof of Lemma 

Equation flSTJ follows from the fact that -< A impl ies -< A\V,V]. Equation (|82|) follows from 
the fact that if A X B then B^ 1 ^ A^ 1 (see, e.g.. iHorn and Johnson! (|l985h . Corollary 7.7,4). 
Finally, equation ([531) follows from the fact that r< A and < B imply -< A + £?. ■ 



Next we describe how to construct, for any set V and any i > e > 0, an e-net for X 



VxV 



denoted by I^ xV . We will be rounding the precision matrices obtained during message passing by 
first p erforming a singular value decomposition or SVD (see, e.g., Chapter 7.3 of lHorn and Johnson! 
(ll985l) L and then separately rounding the orthogonal matrix and diagonal matrix (containing 
the eigenvalues) thus obtained to e-nets for orthogonal matrices and for diagonal matrices re- 
spectively. 

The first ingredient in our construction is an e-net 1Z kx for all k x fc orthogonal matrices. 



Lemma 36 

such that 



For any k and 



> e > there exists a set 1Z 



kxk 



of orthogonal matrices 



,fe-l\ k 



fc(4V2+4) 

\K k e xk \ < (2ke £k/2 (2/e] 
and such that for each fcxfc orthogonal matrix U, there exists a U" £ TZ kxk which satisfies 

(Vt) (U"[:, i]YU[:,i] > 1 - 12.5fce and (Vj ^ i) (U"[:, »])*!/[:, j] < 10(^2 + l)Vke. 



(84) 



Intuitively, Lemma (35] means that columns of the approximation U" of an orthogonal matrix 
U can be obtained from the columns of U by small rotations. We defer the proof of Lemma [35] 
until later. The second ingredient in the construction is an e-net for eigenvalues, which are 
between -^A min (A) and A max (A), defined as 



C F = 



K = 



^min (A) £ ^min (A) 



2s ^min 



(A) 



A m ax(A) 



(A) 

m 



where 



— In ( 



Amm (A) 



(85) 



We are now ready to define our £-net for precision matrices. 



Definition 37. For any set V, we define I I 



VxV 



c x 



VxV 



+ 



to be the collection of all matrices A 



,\V\x\V\ 

"El 

diagonal matrix whose diagonal entries belong to the set C e2 , with 

2 9 



with support VxV such that A[V, V] = R D R l where R e TZ^ 

i (A) 



and where D is a |V| x |V| 



si 



A,, 



mX max (A)J 10 4 \V\ 



Clearly, 



|^|V|x|V|| 



and £2 = e/2. 



m 2 |1/| 3 /A roax (A) 



^min (A) 



(86) 



(87) 



We have chosen the parameters e±, £2 to be sufficiently small such that for any precision matrix 
P € XY xV of rank \V\ with eigenvalues in the range [ A ""^( A ) , A maa; (A)] , there is a matrix P' 
mlY xV which is e-close to P, i.e., P w £ P' . However, by Observation [311 au precision matrices 
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P produced while message passing have eigenvalues precisely in the above range. Hence lY xV 
is indeed a sufficiently fine e-nets for our approximation purposes. We next formally define the 
transformation Round £ which, as one would expect, works by first performing a SVD of the input 
precision matrix and then rounding the resulting orthogonal matrix and the diagonal matrix. 

Lemma 38. For any e < j there is a matrix transformation Round £ with the following property. 
For any set V and for any P G X+ xV of rank \V\ satisfying A ""^( A ) < \ min (P) < X max (P) < 
^maxi-k)? we have that Round e (P) <G xY xV and that Round e (P) w E P. Moreover, Round e (P) 
can be computed from P in time 0(\V\ 3 ). 

Proof of Lemma I38t 

We compute P' = Round £ (P), P e X+ xV as follows. First compute the SVD of P[V,V] = 
UDU*, where D is a diagonal matrix and U is a \V\ x \V\ orthogonal matrix. Note that the di- 
agonal entries of D, namely the eigenvalues of P, are in the range [A m i„(A)/m, A max (A)] as per 
our assumption in the lemma statement. Second, compute diagonal matrix D' by rounding-off 

each entry of D to the nearest element in C £ /2 , i.e., for each i, D'[i,i] = arg min r — D[i, i] \. 

rec E/2 

Next approximate U using an orthogonal matrix U' € , where E\ is as defined in (|86[) . 

using the construction in Lemma l36l Finally define P'[V, V] = U'D'U n . We can clearly perform 
all the steps of the computation in 0(|V^| 3 ) time. 

To prove Round e (P) « e P, we show that U'D'U n = P'[V,V] ^ c £ P{V,V] = c £ UDU l . 
The proof of the other inequality, i.e., e~ £ P[V, V] ^ P'[V, V], is identical and omitted. To prove 
U'D'U' 1 < c £ UDU\ it is sufficient to prove that 

e £ D - U^U'D'U'^U 1 )- 1 h 0. 

However the l.h.s. of the above can be simplified as 

c'D-U^U'D'U'^U 1 )- 1 

= e e D-{U*U')jy{U t U') t 
= (e £ D - RD'R 1 ) (substituting R = XJ l U') 
= (e £ D - D') + (D' - RD'R*) 

h {c £ ' 2 D' - D') + (£>' - RD'R 1 ) ( since (Vi) < c f 

h -D' + (D'R n + R'D' + R'R n ) (substituting R' = R - I {vl ) 
h -D' + D'R n + R'D'. 
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To see that §£>' + D'R n + R'D' y 0, note that for any vector z of size \V\, 
**(^iy + D'R' t + R'D'^z 

>- Efl-^W. 2 



2 2 

10(V2 + l)v/i^5^(£>'[i,i] + £>'[j, j])NN (89) 



^ 4j7[ E [* > l l z ^ + D ' U> 3] A - 2 min {D 1 [i, i] , D' \j, j] } | z t \ \ z 
> 0, 

where in the hrst step we applied Lemma 1551 ([51)) to matrix R' = R — L v i = U t U' — Lyi, and in 
the second step we substituted the value of £\ given by (|86|) . Hence P'[V, V] ^ e £ P[V, V}. ■ 

Finally we give a proof of the construction of our e-net for orthogonal matrices, Lemma 1361 
which requires an e-net for unit vectors. 

Lemma 39. For any k and e > 0, there exists an e-net U k for k- dimensional unit vectors where 

\U*\ < 2kc ek ' 2 (2/e) k ~\ 

and where 

F 2 

(Vz s.t. ||z|| = 1) (3 z' e U k ) ||z'-z|| < e and zV > 1 - — . (90) 
Proof sketch for Le mma I39t 

See, e.g., Lemma 5.5 of iDevrove et al. I dl996ft . ■ 

The following lemma will also be useful in the proof of Lemma 1361 

Lemma 40. For any 1 > e > 0, if Ui, V2, Zi, Z2 are unit vectors such that \1\u2 = 0, ||ui— zi | < e 
and ||u 2 - z 2 || < e, then \z\z 2 \ < {2y/2 + 2)e. 



Proof of Lemma 

By triangle inequality, 

||zi-z 2 || < ||ui - u 2 || + j|ui - zi|| + j|u 2 - z 2 ||, 

(91) 

< (V2 + 2 £ ), V ' 

where in the second inequality follows from the fact that ||ui — u.2 1 = V2, 1 1 ui — zi| < e, and 
j|u 2 — z 2 j| < e. A similar application of the triangle inequality gives 



zi-zall > (V2-2e), (92) 
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After taking squares of both sides of ([9"T ]) and ([52" ]) . we get \z\ z 9 1 < 2V2e+2e 2 < (2%/2+2)e. 



Proof of Lemma 

j^kxk C0Ils i s ts of all (orthog onal) matrices obtained by applying Gram-Schmidt orthonormaliza- 
tion (see, e.g., Chapter 0.6 of lHorn and Johnson! (1985;)) to any k x k matrix with columns chosen 
from li k . Clearly 



\K k E xk \ < \U*\ k < {2ke ek / 2 (2/ef : 

Now consider any k x k orthogonal matrix U with columns U[:, 1], U[:, 2], . . . , [/[:, k]. The matrix 
U" in ([54]) can be computed from U in time 0(fc 3 ) by first rounding-off each column U[:,i] 
to vector € U k , and then applying Gram-Schmidt orthonormalization to obtain columns 
U"[:, 1], U"[:, 2], ... , U"[:, k]. For the rounding-off step, we have by Lemmas 1591 and I4TJ1 that 

(Vi) ||u' t - [/[:,*] || < e and (Vj ^ i) |(uJ)Xl < (2>/2 + 2)e. (93) 

Next we analyze the Gram-Schmidt orthonormalization process. Consider for any i the pro- 
jection of onto the span of u' 1; u' 2 , . . . , uj_ 1; given by 

uJ-||=5> u J- (94) 



First 



Knii 2 = E c2 Kn 2 + E 



- E C J + ( 2 V2 + 2)e E 2c n c j2 (using (JMJ) 

31, h& 

> ^(l-( i -2)(2V2 + 2)e)c J 2 + (2V2 + 2) £ £ ( Cjl - c 

> (l-fc(2^+2)e)^ S 2 , 



2 

J2 J 



J<1 

which, along with the hypothesis that fc ^ 4v /2 +4 - ) > £ i implies 

llu',,11 2 1 

y c 2 < LiE < i- < 2. (95) 

f£ J ~ (l-fc(2V2 + 2)e) " (1 - k(2V2 + 2)e) ~ 

Second, by ([93]) we get 



K||)*uj| = IE c *K)^l 

j<i 

< (2V2 + 2)eY, 



(96) 



< (2^+2)e^2(i- 1) 

< 2(2 + V2)Vke, 
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where in the penultimate step we used the fact that given (|9"5"j). X)j<il c jl can De at most 
y/2(i — 1). The column U"[:,i] produced by Gram-Schmidt is U"[:,i] = j^tt^j- Hence us- 
ing O we get ([/"[:, > K - u j||)' u 'i > 1 - 2(2 + V^)Vke, which in turn implies that 

\\U"[:,{\-^\\ < 2^(2 + V2)Vk~e. (97) 
Applying the triangle inequality along with (|9"3")l . (|9T| we obtain 

- U[:,i}\\ < \\U"[:,i]-v^\\ + \\v^-U[:,i\\\ < 2^/(2 + V2)Vfe + e < 5Vfci. (98) 
It follows from (jgSJ) that (C/"[:, i])*Z7[:, i] > 1 - 12.5te. Combining Lemma HOI with (l9"51) gives 
(V 4 ^ j) (Cr"[:,i])*l7[:,j] < 10(V2 + l)Vte. 

■ 

Having described how to construct e-nets, we are finally ready to analyze how the error due to 
rounding accumulates during message passing. Lemma I4T1 states that the approximate messages 
are not much bigger than the ideal messages, and Lemma [27] shows thats the true error of the 
approximately optimal set of observations (extracted from the approximate messages) is not 
much bigger than the approximate error (given by approximate messages). 

Lemma 41. Consider any e > and any message on^j of height h. For any Pij,Qji,Sij,N 
for which oti^,j(Pij,Qji,Sij,Ni) is finite and for any Qji ~h's Qji there exists Pij ~/j e Pij such 
that 

&i->j(Pij,Qji,Sij,Ni) < eiep((mait{h',h} + h)£)ai^.j(Pij,Qji,Sij,Ni). (99) 

Lemma 42. Let a^j be a message of height h, and consider any Pij, Qji, Sij, Ni for which 
(Xi-tj(Pij,Qij, Sij, Ni) is finite. Then the set of observations S C Vij, obtained by solving the 
approximate dynamic program (|44|) -(|47 j) on subtree Tij, is such that 5 fl Ay = Sij, \S\ < Ni, and 
for any Q yi tt h , £ Qj it 

Rij{Qji,S) < exp((max{h',h} + h)E)ai^j(Pij,Qji,Sij,Ni), (100) 
and such that the distribution on Xa^ given observations in S has precision Pij in T^, where 

The proofs of Lemmas [41] and [42] are almost exactly identical to that of Lemmas [26] and [27] 
respectively. The only difference between the error analysis for the element-wise rounding for 
GFFs and the SVD-based rounding in this section is that while the element-wise rounding error 
increases by a constant factor due to application of transformation Marginal, the SVD-based 
rounding error remains unchanged — compare (|52[) to (|82[) . As a result, the error of the element- 
wise rounding increases exponentially with the height of the message ( Lemmas [26l and [27|) whereas 
the error of the SVD-based rounding scales linearly. Hence we only provide a brief sketch for the 
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proof of LemmaHTl (to illustrate the above mentioned difference) and omit the proof of LemmaH^l 
altogether. 

Proof of Lemma I4U 

Proof is by induction on the depth ft. The base case, h = 1, is identical to that in the proof of 
Lemma 1261 

Consider a message a^j of height h > 1. Let P ki , Q* k , S* k , N k , P ; * , Q* u , S* t , , L*. be the 
optimal choice in (gDJ for Py = Py, Q^j = Q^-, £y = Sy, iVf = JV,-. Let S ife = S^, S« = 
A> fe = JVjf, A>, = iV*, and Ly = Lt.. 

Consider matrices Pki and p, in (|44p such that 

PH~{h-i)ePki and P« « (h _ 1)e P,:, (101) 



and let Qik , Qiz , and Py be given respectively by (J47J) and (|46[) . 

Now Qik is obtained by first adding matrices Avj, Pu,Qji, and then successively applying 
transformations Obs, Marginal and Round £ . By definition, Pu ~(/j_i) £ P ; * and by hypothesis, 
Qji ~h'e Qji- Hence by <JS3j) , 

(Av< + Pli + Qji) ~max{h',(h-l)}e (Ay, + P*i + Qji)- 

Since application of Obs and Marginal do not increase the rounding error (see (|8Tj) and (|82|) 
respectively), we have 



Marginal yA(i g sjU£ij)i A . fe ^. j , ( Obsf A Vi + P, + Qji, Sy U L l 

~m^{h',(h-i)}e Marginalvj^^u^^ Azk \s* k (obs(^A Vi + P t * + Q jh Sy U L*X\ < " 102 ^ 



Q 



ih' 

Applying Round e to the l.h.s. of f|102[) gives us, by the "triangle inequality" (i.e., Observa- 
tion [MP that 

Q lk = Round £ ^Marginal yA( g ijU£ij)i A . k \g. k ^Obs^Ay + P H + Qji, Sy U Ly 

~max{(/i' + l),h}e Qik ■ 

By a similar analysis, using Lemma 1351 and Observation 1341 we have 

Qil ~max{(/i'+l),ft}s Q*il an( l Aj P%j- 

To complete the induction step we compare the corresponding terms on the r.h.s. of equa- 
tions (j4"4"|) and (|4"0|) . Since Qifc ~max{(/i'+i),ft}e *3ifc> ^y induction hypothesis there exists Pki 
which satisfies both (|101[) as well as the following: 

^i(P^,Qt k ,S* ik ,N*) ~ PU 1 U >' m n 1 » > (103) 
< exp (( max{ft', ft,} + h)e) . 
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Similarly since Qu ~ ma x{(fe'+i),/i}£ Q*ii by induction hypothesis there exists Pu which satisfies 
both (jTOTj) and 

~ Tp*~n* q* n*] ^ e X p((m ax {h,h} + h)e). (104) 

Only the trace terms in (|44|) and (|40|) remain to be compared. It follows from combining (|101[) 
and the hypothesis Qji ~h'e Qji with Observation 1331 and Lemma 1331 that the ratio of the trace 
terms, 



TrUAyK,^])" 1 



Tr((A> Vi [V!,V>]y 
Combining fT03]l . ([T04j) and (fT05|) we obtain ([99 



< exp(max{/i',(/i- l)}e). (105) 



Since the error scales linearly in the height of the message, we do not need to run our message 
passing algorithm on shallow tree-decompositions (unlike GFFs for which we used Bodlacndcr's 
construction, see Lemma [28)) in order to get a FPTAS — we can use any tree-decomposition. Our 
main result in this section can be stated as follows. 

Theorem 43. There is a dynamic programming algorithm which, for any k, and any Gaussian 
MRF on a graph of n vertices with tree-width bounded by k does the following: for any 1 > e' > 
and budget b, it outputs a set Sb.e' such that 



The algorithm runs in time 



:(S bE >) < (1 +s') min err (5). 
v ' ' \S\<b v ' 



fAU° (K) 



^■min (A) 

Note that the running time scales as polynomial in the condition number of the input precision 
matrix A for bounded tree-width graphs. This means we obtain an FPTAS if the condition 
number is bounded, e.g., by a polynomial in the size (i.e. number of bits) of the description of 
A — this however is not generally the case. 
Proof of Theorem |43) 

Consider a tree-decomposition T having width k' = K and m < 2n+l clusters, and which satisfies 
our requirements in Note [T] (see page [13]). We ca n construct such a T in time n2°( K ) by using 
the algorithm of Bodlaender (see Theorem 1.1 of Bodlaender ( 19961) ). 



Let £ = 4(2n+i) • We run the algorithm given by (|gi ]) -(|47 ]) and (gSJ) on tree T, with the 
transformation Round e and the £-nets {TY xV } as defined respectively in Lemma ([38]) and in 
Definition l37l Recall that the cluster V m as per our assumption is empty, i.e., V m = 0, and 
is a leaf with neighbour (say) Vi so that Ai m = Vi- Our output, Sb, s 'j is simply the set of 
observations extracted from the approximate message ai^ m (P* m ,Q m i, S* m ,fy where Q m i is an 
all-zeros matrix and where 



Pim: ^im — arg mill Q.i—^ m , (Pirm Qvaii ^im^ o) ■ 

Pim, S, m CA, m 
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Let the optimal solution be S£ and let P* m be the precision matrix of XA jm = given 
observations S£. Now given our choice of s = 4 ^ 2 n+i) anc ^ &i ven that height of T is at most 2n, 
we have 

err(Sb, e ') = —Rim{Qmi,Sh e ') (by the definition of Ri m , see ([37])) 
n 

< exp(e'/4) — <y^m(Pimi Qmi: S^ rn , ^) (by Lemma 1421 and our choice of s) 

n 

< cxp(e'/4) —&i->m(Pim,Qmi, St fl A im , 6) (where P mi is as defined 

in Lemma I4T1 for arguments P* m , Qmi, S£ H Aj m , 6) 

< exp(e'/4)i [ exp(e74) a;_> m ( P im, Qmi, S* b n A im , 6) 

(by Lemma 2T] and our choice of e) 

< (1 + e') -Cti^fP^, Q m i, SI n A lm , 6) 
= (l + e')en(SZ). 

As for the running time, first recall that the time required to construct T is n2°( K3 \ Next, 
note that all precision matrices produced by message passing have supports of size at most 
k x k < k 2 , and hence the size of each e-net used (as per (|H7| ) is at most 



O 



n 2 K 3 ( A maa (A) 



whicn, given « . jrfTIT , is of the orde, ( (feg) J). Hence nsing s P ™ r o pr o- 

sentations for matrices, we can perform the message passing step (|44[) for each edge in T 

in time 

nK ^max (A) ^ 



£' -^min(A) 

giving the claimed time complexity since by construction T has at most 2n edges. 
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